Adaptive Filtering for Efficient Record Linkage
نویسندگان
چکیده
The process of identifying record pairs that represent the same real-world entity in multiple databases, commonly known as record linkage, is one of the important initial steps in many data mining applications. Record linkage of millions of records is a computationally expensive task. Various blocking methods have been used in record linkage systems to reduce the number of record pairs for comparison. A good blocking key is critical to the success of a blocking method and will ideally result in lot of small blocks. However, in practice, there are almost always large blocks no matter how good the blocking key is. For example, when blocking on surname for an AngloCeltic population, Smith and Taylor are populous and result in very large block sizes. The efficiency of a blocking method is hindered by these large blocks since the resulting number of record pairs is dominated by the sizes of these large blocks. In this paper, we present a filtering algorithm to post-process large blocks to enhance the blocking efficiency. Experimental results show that our filtering algorithm can reduce the number of record pairs produced by the standard blocking method by 88% on a small real-world data set. The algorithm also reduces the number of record pairs generated by a 3-pass standard blocking method by 50% on several synthetic test data sets, with minimal loss of accuracy.
منابع مشابه
A Family of Selective Partial Update Affine Projection Adaptive Filtering Algorithms
In this paper we present a general formalism for the establishment of the family of selective partial update affine projection algorithms (SPU-APA). The SPU-APA, the SPU regularized APA (SPU-R-APA), the SPU partial rank algorithm (SPU-PRA), the SPU binormalized data reusing least mean squares (SPU-BNDR-LMS), and the SPU normalized LMS with orthogonal correction factors (SPU-NLMS-OCF) algorithms...
متن کاملAdaptive-Filtering-Based Algorithm for Impulsive Noise Cancellation from ECG Signal
Suppression of noise and artifacts is a necessary step in biomedical data processing. Adaptive filtering is known as useful method to overcome this problem. Among various contaminants, there are some situations such as electrical activities of muscles contribute to impulsive noise. This paper deals with modeling real-life muscle noise with α-stable probability distribution and adaptive filterin...
متن کاملAn Adaptive Hierarchical Method Based on Wavelet and Adaptive Filtering for MRI Denoising
MRI is one of the most powerful techniques to study the internal structure of the body. MRI image quality is affected by various noises. Noises in MRI are usually thermal and mainly due to the motion of charged particles in the coil. Noise in MRI images also cause a limitation in the study of visual images as well as computer analysis of the images. In this paper, first, it is proved that proba...
متن کاملSpeech Enhancement by Modified Convex Combination of Fractional Adaptive Filtering
This paper presents new adaptive filtering techniques used in speech enhancement system. Adaptive filtering schemes are subjected to different trade-offs regarding their steady-state misadjustment, speed of convergence, and tracking performance. Fractional Least-Mean-Square (FLMS) is a new adaptive algorithm which has better performance than the conventional LMS algorithm. Normalization of LMS ...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004